Chromosome-level genome assemblies of Nicotiana tabacum, Nicotiana sylvestris, and Nicotiana tomentosiformis

The Solanaceae species Nicotiana tabacum, an economically important crop plant cultivated worldwide, is an allotetraploid species that appeared about 200,000 years ago as the result of the hybridization of diploid ancestors of Nicotiana sylvestris and Nicotiana tomentosiformis. The previously published genome assemblies for these three species relied primarily on short-reads, and the obtained pseudochromosomes only partially covered the genomes. In this study, we generated annotated de novo chromosome-level genomes of N. tabacum, N. sylvestris, and N. tomentosiformis, which contain 3.99 Gb, 2.32 Gb, and 1.74 Gb, respectively of sequence data, with 97.6%, 99.5%, and 95.9% aligned in chromosomes, and represent 99.2%, 98.3%, and 98.5% of the near-universal single-copy orthologs Solanaceae genes. The completion levels of these chromosome-level genomes for N. tabacum, N. sylvestris, and N. tomentosiformis are comparable to other reference Solanaceae genomes, enabling more efficient synteny-based cross-species research.


Background & Summary
The Nicotiana genus belongs to the Solanaceae family, which also includes tomato (Solanum lycopersicum), potato (Solanum tuberosum), and eggplant (Solanum melongena) 1,2 .While most of the Solanaceae are diploids with 12 chromosome pairs, tobacco (Nicotiana tabacum L.) is an allotetraploid (2n = 4x = 48) resulting from a hybridization event that likely occurred in the Andes within the last 200,000 years between ancestors of Nicotiana sylvestris (S-genome; 2n = 2x = 24) and Nicotiana tomentosiformis (T-genome; 2n = 2x = 24) 3,4 .In addition to being a modern descendant of the N. tabacum maternal progenitor, N. sylvestris, which is nowadays largely cultivated as an ornamental plant, is also one the closest descendants of the ancestral species from the Alatae/Sylvestres section that hybridized as the paternal donor with an ancestral species from the Noctiflorae/ Petunioides section to give rise to the almost all-Australian clade of allopolyploid species constituting the Nicotiana section Suaveolentes 5 .
Similar to other members of the Nicotiana genus, N. sylvestris, N. tomentosiformis, and N. tabacum produce a wide range of alkaloids that are known to be toxic to insects and are a well-established mechanism of defense against herbivores 6 .While N. sylvestris accumulates similar amounts of alkaloids in roots and leaves (3.5 mg/g in roots and 2.1 mg/g in leaves), N. tomentosiformis accumulates more alkaloids in roots (8.8 mg/g in roots and 0.6 mg/g in leaves), and N. tabacum has more in leaves (1.3 mg/g in roots and 12.5 mg/g in leaves) 7 .The composition of the accumulated alkaloids varies between the three species, with N. tabacum benefiting from both of its progenitors' genetic and regulatory contributions.In N. sylvestris roots, 87% of the alkaloids is nicotine, 11% is anatabine, and 1.9% is anabasine, while in leaves, 100% of the alkaloids is nicotine.In N. tomentosiformis roots, 56% of the alkaloids is nornicotine, 28% is anatabine, 14% is nicotine, 1.6% is anabasine, and 0.57% is cotinine, while in leave 73% of the alkaloids is nicotine and 27% is nornicotine.In N. tabacum roots, 87% of the alkaloids is nicotine, and 13% is nornicotine, while in leaves, 92% of the alkaloids is nicotine, 5.1% is nornicotine, and 2.6% is anatabine 7 .
The Nicotiana genus is also a rich source of terpenoids, which play a significant role as attractants to several pollinator insects.In N. tabacum, both cembranoid and labdanoid diterpenoids are synthesized in the trichome glands, whereas N. sylvestris produces predominantly cembranoid diterpenoids and N. tomentosiformis predominantly labdanoid diterpenoids 8 .
Although several Nicotiana species genomes have been published in the last decade, including for N. sylvestris 9 , N. tomentosiformis 9 , and N. tabacum 10,11 , these genomes are primarily based on the assembly of second-generation sequencing data and therefore suffer from an important fragmentation resulting in only partial anchoring to chromosomes.
In the present study, we integrated Illumina short-read sequencing (Illumina, San Diego, CA, USA) with third-generation Oxford Nanopore long-read sequencing and Oxford Nanopore chromosome conformation capture (PoreC) technology (Oxford Nanopore Technologies, Oxford, UK) to generate high-quality chromosome-level reference genomes for N. tabacum, N. sylvestris, and N. tomentosiformis.These new resources will broaden our understanding of the contributions of both N. tabacum progenitors to the genes and the pathways of tobacco and enable more efficient synteny-based cross-species Solanaceae research.

Methods
DNA Extraction and Sequencing.Young leaves from N. tabacum L. Cultivar K326 (PVY resistant derived from USDA ARS GRIN Global NPGS: PI 552505), N. Sylvestris Speg.TW136 (USDA ARS GRIN Global NPGS: PI 555569) and N. tomentosiformis Goodsp.TW142 (USDA ARS GRIN Global NPGS: PI 555572) were snap-frozen with liquid nitrogen and finely ground in a mortar.High molecular weight genomic DNA for long-read sequencing was extracted using Promega Wizard HMW DNA Extraction Kit (Promega AG, Madison, WI, USA).
Short genomic DNA fragments were deleted using Circulomics short-read eliminator kits from PacBio (PacBio, Menlo Park, CA, USA), and long-read sequencing libraries were prepared using Oxford Nanopore Technologies SQK-LSK109 Ligation Sequencing Kits before sequencing on Oxford Nanopore Technologies PromethION R9.4.1 flowcells.About 139 Gb of raw data were collected for N. tabacum, 159 Gb for N. sylvestris, and 76 Gb for N. tomentosiformis.
To conduct chromosome-level assembly, frozen leaves were cut into one square centimeter pieces and treated with formaldehyde to fix the DNA.The fixed genomic DNA was then digested overnight using the NlaIII restriction enzyme, and the 3′ overhangs were re-ligated using T4 ligase before extraction.PoreC sequencing libraries were prepared using Oxford Nanopore Technologies SQK-LSK109 Ligation Sequencing Kits before sequencing on Oxford Nanopore Technologies PromethION R9.4.1 flowcells.About 40 Gb of raw data were collected for N. tabacum, 66 Gb for N. sylvestris, and 63 Gb for N. tomentosiformis.To polish and validate the assembled genomes, Illumina short-reads were prepared for N. tabacum using Tecan Celero EZ DNA-Seq Library Preparation Kits (Tecan, Männedorf, Switzerland) and sequenced as 2 × 151 bp paired-end reads on an Illumina NovaSeq 6000 to generate a total of 139 Gb.Illumina short-reads from ERR274527 12 and ERR274528 13 for N. sylvestris and from ERR274540 14 and ERR274542 15 for N. tomentosiformis were retrieved from the Short Read Archive.
De novo assembly and Chromosome Construction.For N. tabacum, Oxford Nanopore basecalling was performed using Guppy 6.3.7 using the plant super model.Long-read sequences were filtered using seqkit 16 2.2.0 to remove short (length <5000) and low-quality reads (average qscore <9), resulting in 98 Gb (N50 length: 28.5 kb).
Genomes were assembled using flye 17 2.9.1 using the nano-hq input pre-set and a read error rate of 0.03.The Illumina short-reads were processed for each species using fastp 18 0.23.2 to trim adapters and low-quality bases, merge pairs, and remove low complexity and short (length <75) reads.During processing, the reads were split into two sets, one for assembly polishing which contained 80% of the processed Illumina reads and one for assembly validation containing 20% of the processed Illumina reads.
Illumina short-reads were mapped to the assembly contigs using minimap2 21,22 2.24, duplicates marked with samblaster 23 0.1.26,and filtered using samtools 24 1.15.1.The coverage of the assembly contigs by Illumina sequencing was then calculated using samtools 24 1.15.1, and contigs with less than 70% of their length with a coverage of at least 5 for N. tabacum and 15 for N. sylvestris and N. tomentosiformis were removed.
Because the biological material used for sequencing originated from inbred plants that can be considered homozygotes, variants were called using freebayes 25 1.3.6 with the ploidy parameter set to 1 and ignoring sites with coverage higher than 200 and filtered with vcflib 26 1.0.3 vcffilter using the parameters --filter-sitesinfo --filter Variants were then applied to the genomes using bcftools 24  Assembly contigs from plastid and mitochondrion were removed by mapping the polished assembly contigs to the N. tabacum plastid and mitochondrion sequences (NC_001879.2 27 and NC_006581.1 28 , respectively) using minimap2 21,22 2.24 and filtering out contig mapping on more than 50% of their length.
Assembly contigs from possible contamination were identified using kraken2 29 2.1.2using the k2_plus-pfp_20220908 database 30 and removed by only retaining contigs identified as belonging to Nicotiana or Solanum species.
PoreC reads were mapped to the cleaned assembly contigs using minimap2 21,22 2.24.Alignments with a mapping quality lower than 60 for N. tabacum and 30 for N. sylvestris and N. tomentosiformis were discarded, and contact pairs were created from the remaining alignments.The positions on the contigs of each contact pair were recorded as two consecutive lines in a BED file.The scaffolding of the contigs to a chromosome-level assembly was performed using yahs 31 1.2a1.Contact maps were prepared using PretextMap 32 0.1.9,manually curated and annotated in PretextView 33 0.2.5, and the resulting scaffolds exported as chromosome-level sequences.
To name and orient the N. tabacum chromosome-level sequences, the PT markers, mapped to the sequences using hisat2 34 2.2.1 and the tobacco genetic map 35 , were used.Similarly, the N. tomentosiformis chromosome-level sequences were named and oriented using the N genetic map 36 combined with the tobacco PT markers 35 .The chromosome-level assembly of the N. tomentosiformis genome was then used as a reference to name and orient the N. sylvestris chromosome-level sequences based on minimap2 21,22 2.24 mapping (Fig. 1).
The proportion of the assembly anchored to chromosomes reached 99.5%, 95.9%, and 97.6% of the total assembly lengths for N. sylvestris, N. tomentosiformis, and N. tabacum, respectively (Table 1).
When compared to the previously available N. tabacum genome assembly 11 generated from short-read sequencing, whole genome profiling and optical and genetic mapping data, the new N. tabacum genome assembly has fewer contigs (decrease from 1,257,801 to 1410) with a larger N50 length (increase from 9.1 kb to 11.8 Mb), and the proportion of the assembly anchored to chromosomes consequently improved from 64% to 97.6%.Retrotransposon prediction and annotation.Nested retrotransposons were annotated by iteratively running genometools 1.6.2ltrharvest 37 using the parameters -similar 70 -seed 20 -minlenltr 100 -maxlenltr 7000 -mindistltr 1000 -maxdistltr 15000 -mintsd 4 -maxtsd 6 -motif TGCA -motifmis 3 -vic 10 -overlaps best, retaining the predictions matching to the RepeatExplorer Viridiplantae 3.0 dataset 38 using diamond 39 2.1.6blastx with the parameters --max-target-seqs 1 --ultra-sensitive --frameshift 15, and excising them from the assembly using samtools 24 1.17.At most, 20 prediction-filtering-excision iterations were performed.The predicted retrotransposons were classified by their homology to the RepeatExplorer Viridiplantae 3.0 dataset 38 sequences.Their age was estimated under the assumption that their long terminal repeats (LTRs) were identical at the time of insertion by aligning their 3′ and 5′ LRTs using clustalo 40,41 1.2.4,calculating their divergence (K) using the Kimura-2-parameter distance and dividing it by twice 1.5 × 10 −8 substitution per site per year (r) 42 .
The predicted retrotransposons covered 26.6%, 32.2%, and 29.3% of the N. sylvestris, N. tomentosiformis, and N. tabacum genomes, respectively (Table 2).Regardless of the species, the most frequent element subclass is Ty3/ gypsy|chromovirus|Tekay, representing between 40% and 56% of the total predicted retrotransposon length.The only element subclass that shows a marked difference between the three species is Ty3/gypsy|non-chromovirus|O TA|Tat|Ogre, which covers 116,167,517 bp (18.8% of the total predicted retrotransposon length) in N. sylvestris, and only 21,672,795 bp (3.9%) in N. tomentosiformis.In N. tabacum, it covers 135,653,424 bp (11.6%), close to the sum of its coverage in the two precursor species (137,840,312 bp).Looking at the predicted insertion ages, a recent expansion of the Alesia and Angela subclasses of Ty1/copia and of the Ogre subclass of Ty3/gypsy retrotransposons in N. sylvestris and N. tabacum, but not in N. tomentosiformis, is observed (Fig. 2).
Coding-gene prediction and annotation.Genomes were masked using blast 43,44 2.14.0 windowmasker with dusting, and augustus 45 3.5.0was used for gene prediction.A training dataset was created by separately mapping S. lycopersicum, S. tuberosum, and Nicotiana attenuata cDNA and CDS from Ensembl 56 using min-imap2 21,22 2.26 to the N. sylvestris and N. tomentosiformis genomes.Any sequence with an annotation matching 'hypothetical' , 'unknown' , 'polyprotein' , 'domain-containing' , 'chloroplast' , or 'mitochondria' were omitted from the mapping.Gene models were constructed from the mapped sequences using bedtools 46 2.30.0 and filtered using gffread 47 0.12.7 with the parameters -V -H -U -N -P -J -M -K -Q -Y -Z -F --keep-exon-attrs.Training sequences were then extracted from the genomes using the obtained GFF annotation file and adding 1,000 bp flaking regions.One-fourth of the gene models were set aside for testing for each combination of species and dataset.After merging the training and testing datasets, a Nicotiana model was trained using the etraining and optimize_augustus.plprograms bundled with augustus 45 3.5.0.A total of 10,092 loci were used for training, and 3,362 loci were used for testing.
To complement the augustus predictions, additional gene models were created by separately mapping the predicted N. sylvestris, N. tomentosiformis, and N. tabacum cDNA and CDS and the S. lycopersicum, S. tuberosum, and N. attenuata cDNA and CDS from Ensembl 56 to the genomes using minimap2 21,22 2.26.Models that overlapped augustus predictions by 25% or more according to bedtools 46 2.30.0 intersect were then filtered out by IDs using gffread 47 0.12.7 with the parameters -P -M -K -Q -Y -Z -F, and the remaining genes models were added to those predicted with augustus 45 3.5.0.

Data Records
The genomes and annotations are available from Zenodo under records 8256252 75 , 8256254 76 , and 8256256 77 .The trained Nicotiana model for augustus gene prediction is available from Zenodo under record 8256280 78 .The genomes have been deposited at DDBJ/ENA/GenBank under the accessions ASAF00000000 79 , ASAG00000000 80 and AWOJ00000000 81 .

technical Validation
The quality and completeness of the assemblies were assessed with yak 105 0.1 using 20% of the processed Illumina short-reads which were set aside for that purpose.For N. tabacum, Quality Coverage and Quality Value of 0.982 and 38.1 were obtained; for N. sylvestris, they were of 0.993 and 41.5; and for N. tomentosiformis they were of 0.991 and 43.2.
The quality of the gene predictions from the trained Nicotiana model was evaluated using the prepared testing sets and compared with results obtained using already available models for arabidopsis, tomato, and coyote_tobacco models (Table 3).
The completeness of the gene model sets was evaluated using BUSCO 106 5.4.7 with the solanales_odb10 lineage dataset.Completeness of 98.1%, 95.1%, and 96.1% at the transcript level and of 97.0%, 92.8%, and 93.4% at the protein level were obtained for N. tabacum, N. sylvestris, and N. tomentosiformis, respectively (Table 4).These values are similar to those obtained for S. lycopersicum, of 95.0% at the transcript level and 92.3% at the protein level.

Fig. 1
Fig. 1 PoreC contact maps.Intra-chromosomal and inter-chromosomal contacts are shown for the Nicotiana sylvestris, Nicotiana tomentosiformis, and Nicotiana tabacum genome assemblies.The black bottom and right edges correspond to unplaced sequences.

Fig. 2
Fig. 2 Predicted retrotransposon insertion ages.(a) Predicted insertion ages in millions of years for retrotransposons of the Ty1/copia superfamily; (b) Predicted insertion ages in millions of years for retrotransposons of the Ty3/gypsy superfamily.

Table 1 .
Chromosome 1.15.1 consensus to generate the polished assembly contigs.length, total assembly length, and percentage of the assembly anchored to chromosomes for Nicotiana sylvestris, Nicotiana tomentosiformis, and Nicotiana tabacum.

Table 2 .
Predicted retrotransposons length and genome coverage statistics.

Table 3 .
Augustus testing metrics with the arabidopsis, tomato, coyote_tobacco, and Nicotiana models.

Table 4 .
Statistics of the BUSCO genome, transcripts, and proteins completeness evaluation using the solanales_odb10 lineage dataset for Nicotiana sylvestris, Nicotiana tomentosiformis and Nicotiana tabacum.